Below we are doing in-depth analysis of Oscars data to find meaningful insights. We'll try to answer list of questions by doing various analysis of data.
Below are list of questions which we'll try to answer while doing data analysis:
Can we come up with a model that could predict the winner among the nominees? Or determine their success factors? Has there been a change over time, is there an obvious pattern? (The trend of some notable groups?) Or in more detail:
Do IMDB ratings ( audience taste) agree with the academy members opinion?
Does budget necessarily play an important role?
Does diversity or lack of diversity of cast and crew in terms of race, gender, sexual orientation, etc. show a pattern of some sort?
What movie genre and categories are more likely to win the award?
We'll be taking into consideration Actors, Actresses and Directors data collected over years for our analysis purpose.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, log_loss, classification_report
from collections import Counter
import warnings
import copy
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 30)
%matplotlib inline
It involves list of steps to be performed to prepare data for various tasks:
Below we are loading dataset which has information about nominations for Actors, Actress and Directors for all years along with winner information as well. It contains information about movie to which they belong as well.
Also original dataset has film and winner data was swapped after particular entry ("Cimarron") hence there is code below to correct that part as well.
Data Source: https://www.kaggle.com/theacademy/academy-awards
main_df = pd.read_csv('database.csv')
award_categories = ['Actor', 'Actress', 'Directing']#, 'Directing (Comedy Picture)', 'Directing (Dramatic Picture)', 'Documentary']
actor_nomination = main_df[main_df.Award == 'Actor']
actress_nomination = main_df[main_df.Award == 'Actress']
directing_nomination = main_df[main_df.Award == 'Directing']
## Below code is responsible for lining up Film and Winners. Column Film and Winners data seems to have been flipped
## after cimarron entry.
names, films = [], []
cimarron_detected = False
for name,film in zip(directing_nomination.Name, directing_nomination.Film):
if name == 'Cimarron ':
cimarron_detected = True
if cimarron_detected:
names.append(film)
films.append(name)
else:
names.append(name)
films.append(film)
directing_nomination['Name'] = names
directing_nomination['Film'] = films
#print(actor_nomination.shape[0]+actress_nomination.shape[0]+directing_nomination.shape[0])
main_df = pd.concat([actor_nomination, actress_nomination, directing_nomination])
main_df = main_df.drop(['Ceremony'], axis=1)
main_df['Film'] = main_df.Film.str.strip()
main_df['Winner'] = main_df['Winner'].fillna(0)
print('Total Data Size : ',main_df.shape)
main_df.head()
Below we are loading winning actors and their movies data. It also contains information about movie duration, num of nominations in that year and genres data as well.
Data Source: https://cs.uwaterloo.ca/~s255khan/oscars.html
winning_actors = pd.read_csv('actors.csv')
winning_actors['name'] = winning_actors.name.str.strip()
winning_actors = winning_actors.drop(['synopsis'], axis=1)
print('Winning Actors Size : ',winning_actors.shape)
winning_actors.head()
Below we are loading winning actress and their movies data. It also contains information about movie duration, num of nominations in that year and genres data as well.
Data Source: https://cs.uwaterloo.ca/~s255khan/oscars.html
winning_actresses = pd.read_csv('actresses.csv')
winning_actresses['name'] = winning_actresses.name.str.strip()
winning_actresses = winning_actresses.drop(['synopsis'], axis=1)
print('Winning Actresses Size : ',winning_actresses.shape)
winning_actresses.head()
Below we are loading winning directors and their movies data. It also contains information about movie duration, num of nominations in that year and genres data as well.
Data Source: https://cs.uwaterloo.ca/~s255khan/oscars.html
winning_directors = pd.read_csv('directors.csv')
winning_directors['name'] = winning_directors.name.str.strip()
winning_directors = winning_directors.drop(['synopsis'], axis=1)
print('Winning Directors Size : ',winning_directors.shape)
winning_directors.head()
Below we are loading winning movies data. It also contains information about movie duration, num of nominations in that year and genres data as well.
Data Source : https://cs.uwaterloo.ca/~s255khan/oscars.html
winning_movies = pd.read_csv('pictures.csv')
winning_movies['name'] = winning_movies.name.str.strip()
winning_movies = winning_movies.drop(['synopsis'], axis=1)
print('Winning Movies Size : ',winning_movies.shape)
winning_movies.head()
Below we are loading demographics dataset which has information about Oscar award winners actors, directors, actresses birthplace, ethnicity, religion and sexual orientation data.
Data Source : https://data.world/crowdflower/academy-awards-demographics
demographics_data = pd.read_csv('Oscars-demographics-DFE.csv', encoding='latin1')
#print(demographics_data.columns)
cols = ['award','movie','person','birthplace','date_of_birth','race_ethnicity','religion','sexual_orientation']
demographics_data = demographics_data[cols]
demographics_data['movie'] = demographics_data.movie.str.strip()
print('Demographics Data : ', demographics_data.shape)
demographics_data.head()
demographics_data['award'].unique()
Below we are loading movies metadata which has information about movies budget, gross income, imdb score, genres etc.
Data Source : https://data.world/popculture/imdb-5000-movie-dataset
metadata = pd.read_csv('movie_metadata.csv')
#print(metadata.columns)
cols = ['movie_title', 'budget', 'gross']
cols2 = ['movie_title', 'budget', 'gross', 'duration', 'genres', 'imdb_score']
#metadata = metadata[cols2]
metadata['movie_title'] = metadata.movie_title.str.strip()
print('Metadata Size : ',metadata.shape)
metadata.head()
Below we are merging movies metadata with winning actors data to get information about budget, gross income of movies etc.
Please notice that data about budget and gross income is added to dataframe.
winning_actors = winning_actors.merge(metadata[cols], how='left', left_on='name', right_on='movie_title').drop(['movie_title'], axis=1)
winning_actors.head()
Below we are merging metadata with winning actress data to get information about budget, gross income of movies etc.
Please notice that data about budget and gross income is added to dataframe.
winning_actresses = winning_actresses.merge(metadata[cols], how='left', left_on='name', right_on='movie_title').drop(['movie_title'], axis=1)
winning_actresses.head()
Below we are merging metadata with winning directors data to get information about budget, gross income of movies etc.
Please notice that data about budget and gross income is added to dataframe.
winning_directors = winning_directors.merge(metadata[cols], how='left', left_on='name', right_on='movie_title').drop(['movie_title'], axis=1)
winning_directors.head()
Below we are merging metadata with winning movies data to get information about budget, gross income of movies etc.
Please notice that data about budget and gross income is added to dataframe.
winning_movies = winning_movies.merge(metadata[cols], how='left', left_on='name', right_on='movie_title').drop(['movie_title'], axis=1)
winning_movies.head()
Below we are merging demographics data with winning actors data to get information about birthplace, ethnicity, religion, sexual orientation etc.
Please note that information about birthplace, date of birth, race , religion, sexual orientation are added.
winning_actors = winning_actors.merge(demographics_data[demographics_data.award == 'Best Actor'], how='left', left_on='name', right_on='movie').drop(['movie', 'award'], axis=1)
winning_actors.head()
Below we are merging demographics data with winning actress data to get information about birthplace, ethnicity, religion, sexual orientation etc.
Please note that information about birthplace, date of birth, race , religion, sexual orientation are added.
winning_actresses = winning_actresses.merge(demographics_data[demographics_data.award == 'Best Actress'], how='left', left_on='name', right_on='movie').drop(['movie', 'award'], axis=1)
winning_actresses.head()
Below we are merging demographics data with winning directors data to get information about birthplace, ethnicity, religion, sexual orientation etc.
Please note that information about birthplace, date of birth, race , religion, sexual orientation are added.
winning_directors = winning_directors.merge(demographics_data[demographics_data.award == 'Best Director'], how='left', left_on='name', right_on='movie').drop(['movie', 'award'], axis=1)
winning_directors.head()
Below we are adding demographics and movies metadata to dataframe with have information about nominations along with winners per year.
Please note that information about birthplace, date of birth, race , religion, sexual orientation are added.
Note: This dataframe will be used for machine learning hence we combined all data.
main_df = main_df.merge(metadata[cols2], how='left', left_on='Film', right_on='movie_title').drop(['movie_title'], axis=1)
main_df = main_df.merge(demographics_data, how='left', left_on='Film', right_on='movie').drop(['movie', 'award'],axis=1)
main_df.head()
Coming up with a machine learning algorithm that can predict the winner fairly accurately and showing its feature_importance can be a great way to proof what factors are playing an important role in award winnings in Academy Award.
Below we are creating dataframe for machine leaning model. We'll be trying Random Forest machine learning model which given input, will predict whether entry will win award or not. Below are list of steps which needs to be performed to make data ready for ML Models:
## One Hot Encoding part for Genres.
## One hot encoding refers to creating one entry for particular value of column.
## Hence genre will be one column in dataframe and it'll have value of 1 if that genre is present for movie
## else value will be 0.
## So if movie is action/animation then column for action and animation will have 1 entry and all other columns will
## have 0 in it.
genres_cols = ['Action','Adventure','Animation','Biography','Comedy','Crime','Drama','Family','Fantasy','Film-Noir','History',
'Horror','Music','Musical','Mystery','Romance','Sci-Fi','Sport','Thriller','War','Western']
genres_data = []
## Below logic loops through all rows of dataframe and for each movie whichever genres are present
## that generas will have 1 as entry and all other generes which are not part of movie will have 0.
## Example : generes = Action | Adventure | War then entry for that row genres will be like [1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]
for g1 in main_df.genres: ## Looping through genres. Check values of genres in main_df dataframe above to get idea.
if isinstance(g1, str):
g_row = []
for g2 in genres_cols:
if g2 in g1:
g_row.append(1) ## If genre is present for movie then 1 is added else 0 is added.
else:
g_row.append(0)
genres_data.append(g_row)
else:
genres_data.append([0] * len(genres_cols))
genres_data = np.array(genres_data)
ml_df = main_df.copy() ## Creating duplicate dataframe whose values will be used as input to ML Model.
for i, col in enumerate(genres_cols):
ml_df[col] = genres_data[:,i] ### Putting Individual Generes data in dataframe.
ml_df.head()
## Please make a note that we are making below changes because ML model can not handle NA values and all values should be present.
ml_df['race_ethnicity'] = ml_df['race_ethnicity'].fillna('Na') ## Filling NA for rows where race is not present.
ml_df['religion'] = ml_df['religion'].fillna('Na') ## Filling NA for rows where religion is not present.
ml_df['sexual_orientation'] = ml_df['sexual_orientation'].fillna('Na') ## Filling NA for rows where sexual orientation is not present.
ml_df['budget'] = ml_df['budget'].fillna(ml_df['budget'].mean()) ## Replacning mean of budget where budget is not present for movie
ml_df['gross'] = ml_df['gross'].fillna(ml_df['gross'].mean()) ## Replacning mean of gross where gross is not present for movie
ml_df['duration'] = ml_df['duration'].fillna(ml_df['duration'].mean()) ## Replacning mean of duration where duration is not present for movie
ml_df['imdb_score'] = ml_df['imdb_score'].fillna(ml_df['imdb_score'].mean()) ## Replacning mean of imdb rating where imdb rating is not present for movie
ml_df.head()
Below we are taking all columns which are decided to go into ML model and creating array out of it. We are also taking winner column as target variable. We have decided on a list of column which will be used for ML model.
Here labels are defined as below:
0 - Nominated
1 - Winner
cols_for_model = genres_cols + \
['Award', 'budget','gross','duration','imdb_score', 'race_ethnicity','religion','sexual_orientation']
### Pd.get_dummies does one hot encoding for given list of columns given as input. Please check output once
## To see all entries for these columns will be one hot encoded.
ml_df_one_hot = pd.get_dummies(ml_df[cols_for_model], columns = ['Award', 'race_ethnicity', 'religion', 'sexual_orientation'])
X = ml_df_one_hot.values
Y = ml_df['Winner'].values
print('Dataset Size : ',X.shape, Y.shape)
Below we are splitting data into train set(80%) and test set(20%). Training data will be used for training model and test data will be used to evaluate its performance.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, test_size=0.2, stratify=Y, random_state=42)
print('Train/Test Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
print(Counter(Y_train), Counter(Y_test))
rf = RandomForestClassifier(n_estimators=150, max_depth=None)
rf.fit(X_train, Y_train)
print('Train Accuracy : %.2f%%'% (rf.score(X_train, Y_train)*100))
print('Test Accuracy : %.2f%%'% (rf.score(X_test, Y_test)*100))
#print('\nClassification Report : ')
#print(classification_report(Y_test, rf.predict(X_test)))
conf_mat = confusion_matrix(Y_test, rf.predict(X_test))
fig = go.Figure()
fig = go.Figure(data=go.Heatmap(z=conf_mat, x= [0,1], y=[0,1]))
fig.update_layout(title="Confusion Matrix",xaxis_title="Predicted",yaxis_title="Actual")
fig.show()
Our random forest model is giving us quite good accuracy.We are getting good accuracy of 88% for test and 92% for train set.
We can see that it guessed correctly that 147 entries did not win award and 41 entries who won awards. It made few mistakes where person had not won award but model predicted it did for 7 entries and where the nominee actually won an award but our model predicted that it did not for 18 entries.
Confusion Matrix Explantion: Confusion matrix represents how our model performed. It presents how many entry it predicted right and for how many it made mistakes.
Confusion Matrix =
[[False Negative,True Positive],
[True Negative, False Positive]]
When you hover over particular square of heatmap then it'll show count in z values.
Example:
Say for example we take yellow cell, when you hover over it, it says z=147. if you look at Y axis it says Actual - 0 and X axis says Predicted - 0. This means that for 147 rows actual value was 0 and model also predicted 0. It means that for 147 person/movie had lost award and our model also predicted same. We can interpret other cells in same way.
Below we are printing weights of models for each feature. Very high value for particular feature will indicate that that feature is contributing more to person/movie winning award. Below higher weight features refer to contributing more to winning award.
Please also make a note that number of feature names mentioned below are same as that column of ml dataframe we created above.
ignore_cols = ['race_ethnicity_Na', 'sexual_orientation_Na', 'religion_Na']
with plt.style.context(('seaborn', 'ggplot')):
## Creating dataframe of columns and their weights in model
feature_importance = pd.DataFrame({'Columns': ml_df_one_hot.columns, 'Feature_Importance': rf.feature_importances_.flatten()})
## Removing Na values entries.
feature_importance = feature_importance[~feature_importance.Columns.isin(ignore_cols)]
feature_importance = feature_importance.sort_values(by='Feature_Importance')
plt.figure(figsize=(20,5))
plt.bar(x=range(feature_importance.shape[0]), height=feature_importance['Feature_Importance'],)
plt.xticks(range(feature_importance.shape[0]), feature_importance.Columns, rotation='vertical')
plt.xlabel('Features')
plt.ylabel('Feature Weight')
plt.title('Random Forest Feature Importance Plot')
We can clearly see from the list of attributes, which ones contributed heavily as a winning factor for our model to be that accurate. A sorted list can be seen above, it might look safe to say that there is a systematic problem in awarding system when one's race and sexual orientation are among the most important factors in winner prediction. There is a good potential for bias in awarding system based on Sexual orientation and Race. Factors like Duration, imdb scores, budget, revenue, whether or not the movie's director, actress and actor won the award could be considered fair and technical features to be of importance and we are happy to see those there.
One interesting finding, and also something for movie producers to think about is that, for a movie to be the winner, winner director and then a winner actress helped with our prediction, more importantly than usually higher paid male actors, it could be the time for equal pay?
years, non_white_count, white_count = [], [], []
non_white_races = ['Asian', 'Black','Hispanic', 'Middle Eastern', 'Multiracial']
for year in winning_actors.year.unique():
winning_actors_for_the_year = winning_actors[winning_actors.year == year]
winning_actress_for_the_year = winning_actresses[winning_actresses.year == year]
winning_directors_for_the_year = winning_directors[winning_directors.year == year]
non_white_actors = winning_actors_for_the_year[winning_actors_for_the_year.race_ethnicity.isin(non_white_races)]
white_actors = winning_actors_for_the_year[winning_actors_for_the_year.race_ethnicity == 'White']
non_white_actress = winning_actress_for_the_year[winning_actress_for_the_year.race_ethnicity.isin(non_white_races)]
white_actress = winning_actress_for_the_year[winning_actress_for_the_year.race_ethnicity == 'White']
non_white_directors = winning_directors_for_the_year[winning_directors_for_the_year.race_ethnicity.isin(non_white_races)]
white_directors = winning_directors_for_the_year[winning_directors_for_the_year.race_ethnicity == 'White']
years.append(year)
non_white_count.append(non_white_actors.shape[0]+non_white_actress.shape[0]+non_white_directors.shape[0])
white_count.append(white_actors.shape[0]+white_actress.shape[0]+white_directors.shape[0])
fig = go.Figure()
fig.add_trace(go.Scatter(x=years, y=non_white_count,
name='Non-White Winners Count Over the years',
mode='lines+markers',
connectgaps=True,
line=dict(color='firebrick', width=3)))
fig.add_trace(go.Scatter(x=years, y=white_count,
name='White Winners Count Over the years',
mode='lines+markers',
connectgaps=True,
line=dict(color='green', width=3)))
fig.update_layout(title="Comparison of White vs Non-White Winners over time",xaxis_title="Years",yaxis_title="Winners Count")
fig.show()
The graph above shows frequency of winning awards based on race, green line showing white winners and red line showing non-white winners. Lower contribution of non-white winners till 2000 can be seen. After 2000 there seems to be an improvement.
Note: We have ignored entries where race data was not present.also 2014 has race data not available.
We are interested in looking at a pattern of some sort to see if awarding system is anyhow affected based on IMDB ratings, audience pressure positively or negatively.
fig = go.Figure(data=go.Scatter(x=winning_movies.year,
y=winning_movies.rating,
mode='markers',
marker_color=winning_movies.rating,
marker=dict(size=winning_movies.rating, showscale=True),
text=winning_movies.name)) # hover text goes here
fig.update_layout(title="Winning Movies Rating Over Time",xaxis_title="Years",yaxis_title="IMDB Rating")
fig.show()
We can say that on average oscar winning movies has quite hight rating with average quite reguraly more than 7. It helps us deduce that movie with more than 7 rating have higher chances of winning award. But also prove that for an movie to win the award, popularity and audience taste is not eveyrthing. Academy award shows to be not a popularity contest. It's also safe to say really low quality movies too do not win awards.
We thought it's also important to look at what role does money play in this industry. Does spending more always result in a better movie?
fig = go.Figure(data=go.Scatter(x=winning_movies.year,
y=winning_movies.budget,
mode='markers',
marker_color=winning_movies.budget,
marker=dict(showscale=True),
text=winning_movies.name)) # hover text goes here
fig.update_layout(title='')
fig.update_layout(title="Winning Movies Budget Over Time",xaxis_title="Years",yaxis_title="Movie Budget")
fig.show()
## We can see from above the graph that overall budget seems to increase after 1995 with few movies exceeding 50M but also movies with budget less than 50M are also perfoming well.
## and below you see how much a total of 5000 movies in our imdb data set has spent on production and budget in general
## higher budget spending not neccessarily translates into academy award winning
fig = go.Figure(data=go.Scatter(x=metadata.title_year,
y=metadata.budget,
mode='markers',
marker_color=metadata.budget,
marker=dict(showscale=True),
text=metadata.movie_title)) # hover text goes here
fig.update_layout(title='')
fig.update_layout(title="Movies Budget Over Time",xaxis_title="Years",yaxis_title="Movie Budget")
fig.show()
First chart shows distribution of money throughout the years for each movie that won the award, vs the chart below that shows money spent on all movies through the years. There are really a few movies above the 50 Million line in the first chart, whereas in the second chat you can see the load of movies that spent more than 50 Million. Budget of Less than 50 Million usually resulted in a award winnig movie, hence proving money is not everything in this industry.
After reviewing a few technical details of movies, It's probaly not a bad idea to dive deeper into factors like race, gender, sexual orientation and more demographic type of factors.
with plt.style.context(('ggplot', 'seaborn')):
race_ethnicity = []
for df in [winning_actors, winning_actresses, winning_directors]:
race_ethnicity.append(pd.get_dummies(df[['race_ethnicity']]))
race_ethnicity = pd.concat(race_ethnicity)
## Please ignore Na as its for rows where data was not present.
cols = race_ethnicity.columns.tolist()
#cols.remove('race_ethnicity_Na')
race_ethnicity = race_ethnicity[cols]
cols = [val.replace('race_ethnicity_','') for val in cols]
race_counts = race_ethnicity.sum(axis=0)
race_df = pd.DataFrame({'Race_Cols':cols, 'Race_Counts':race_counts})
race_df = race_df.sort_values(by='Race_Counts')
plt.figure(figsize=(8,5))
plt.bar(x=range(len(race_df)), height=race_df.Race_Counts, color='tab:red')
plt.xticks(range(len(race_df)), race_df.Race_Cols, rotation='vertical')
plt.xlabel('race_ethnicity')
plt.ylabel('Winner Count')
plt.title('Award Winners Count By Race')
for i in range(len(race_df.Race_Counts)):
if race_df.Race_Counts[i] >= 0:
plt.text(i, race_df.Race_Counts[i], '%.1f'%race_df.Race_Counts[i], horizontalalignment='center', verticalalignment='bottom')
else:
plt.text(i, race_df.Race_Counts[i], '%.1f'%race_df.Race_Counts[i], horizontalalignment='center', verticalalignment='top')
We can see that ethnicity white resulted in a meaningful higher count of award winning . Majority of oscar winners are shown to be white from our available data. Not sure the difference among other races happen to be very meaningful. We don't neccessarily see non-white representation among academy award winners (Best Acotr, Best Actress, Best Director).
with plt.style.context(('ggplot', 'seaborn')):
religion = []
for df in [winning_actors, winning_actresses, winning_directors]:
religion.append(pd.get_dummies(df[['religion']]))
religion = pd.concat(religion)
## Please religion Na as its for rows where data was not present.
cols = religion.columns.tolist()
cols.remove('religion_Na')
religion = religion[cols]
cols = [val.replace('religion_','') for val in cols]
religion_counts = religion.sum(axis=0)
religion_df = pd.DataFrame({'religion_Cols':cols, 'religion_Counts':religion_counts})
religion_df = religion_df.sort_values(by='religion_Counts')
plt.figure(figsize=(8,5))
plt.bar(x=range(len(religion_df)), height=religion_df.religion_Counts, color='tab:green')
plt.xticks(range(len(religion_df)), religion_df.religion_Cols, rotation='vertical')
plt.xlabel('Religion')
plt.ylabel('Winner Count')
plt.title('Award Winners Count By Religion')
for i in range(len(religion_df.religion_Counts)):
if religion_df.religion_Counts[i] >= 0:
plt.text(i, religion_df.religion_Counts[i], '%.1f'%religion_df.religion_Counts[i], horizontalalignment='center', verticalalignment='bottom')
else:
plt.text(i, religion_df.religion_Counts[i], '%.1f'%religion_df.religion_Counts[i], horizontalalignment='center', verticalalignment='top')
Chart above shows religion distribution of academy awards winners, with Roman Catholics, Jews, and Atheists, leading the numbers.
with plt.style.context(('ggplot', 'seaborn')):
sexual_orientation = []
for df in [winning_actors, winning_actresses, winning_directors]:
sexual_orientation.append(pd.get_dummies(df[['sexual_orientation']]))
sexual_orientation = pd.concat(sexual_orientation)
## Please ignore Na as its for rows where data was not present.
cols = sexual_orientation.columns.tolist()
cols.remove('sexual_orientation_Na')
sexual_orientation = sexual_orientation[cols]
cols = [val.replace('sexual_orientation_','') for val in cols]
sexual_orientation_counts = sexual_orientation.sum(axis=0)
sexual_orientation_df = pd.DataFrame({'sexual_orientation_Cols':cols, 'sexual_orientation_Counts':sexual_orientation_counts})
sexual_orientation_df = sexual_orientation_df.sort_values(by='sexual_orientation_Counts')
plt.figure(figsize=(7,5))
plt.bar(x=range(len(cols)), height=sexual_orientation_df.sexual_orientation_Counts, color='tomato')
plt.xticks(range(len(cols)), cols, rotation='vertical')
plt.xlabel('sexual_orientation')
plt.ylabel('Winner Count')
plt.title('Award Winners Count By Sexual Orientation')
for i in range(len(sexual_orientation_df.sexual_orientation_Counts)):
if sexual_orientation_df.sexual_orientation_Counts[i] >= 0:
plt.text(i, sexual_orientation_df.sexual_orientation_Counts[i], '%.1f'%sexual_orientation_df.sexual_orientation_Counts[i], horizontalalignment='center', verticalalignment='bottom')
else:
plt.text(i, sexual_orientation_df.sexual_orientation_Counts[i], '%.1f'%sexual_orientation_df.sexual_orientation_Counts[i], horizontalalignment='center', verticalalignment='top')
We do see a better chance in winning when the nominee appeared to be straight, and after that bisexuals had a higher chance. Lower numbers in Gay, Lesbian communities would need some attention.
We also thought it'd be interesting to see if an specific genre has higher potential of award winning? A certain genre could entail what judges are looking for, certain movies might not be taken seriously?
with plt.style.context(('ggplot', 'seaborn')):
plt.figure(figsize=(12,5))
genre_counts = ml_df[genres_cols].sum(axis=0)
genre_df = pd.DataFrame({'genre_Cols':genres_cols, 'genre_Counts':genre_counts})
genre_df = genre_df.sort_values(by='genre_Counts')
plt.bar(x=range(len(genre_df.genre_Cols)), height=genre_df.genre_Counts)
plt.xticks(range(len(genre_df.genre_Cols)), genre_df.genre_Cols, rotation='vertical')
plt.xlabel('Genres')
plt.ylabel('Winner Count')
plt.title('Award Winners Count By Genres')
for i in range(len(genre_df.genre_Counts)):
if genre_df.genre_Counts[i] >= 0:
plt.text(i, genre_df.genre_Counts[i], '%.1f'%genre_df.genre_Counts[i], horizontalalignment='center', verticalalignment='bottom')
else:
plt.text(i, genre_df.genre_Counts[i], '%.1f'%genre_df.genre_Counts[i], horizontalalignment='center', verticalalignment='top')
Above relation between genres of movies and winner of awards shows that Drama movies has highest chance of winning awards. After that History, War, Adventure and Crime has also quite good chance of winning awards. We also need to keep in mind that a movie could have more than 1 genre.
Below are further list of stats about winning actors, winning actresses, directors and movies.
print('Average Nominations for Winning Actors : %.2f'%np.mean([int(nom) for nom in winning_actors['nominations'] if nom.isdigit()]))
print('\nAverage Movie Duration for Winning Actors : %.2f'%winning_actors.duration.mean())
genres = Counter(winning_actors['genre1'].values.tolist()+winning_actors['genre2'].values.tolist())
sorted_genres = list(sorted(genres.items(), key=lambda x: x[1], reverse=True))
print('\nTop 5 Genre for Winning Actors : %s'%sorted_genres[:5])
print('\nAverage Metacritic Ratings for Winning Actors : %.2f'%winning_actors.metacritic.mean())
print('\nAverage Budget of Movies for Winning Actors : %.2f'%winning_actors.budget.mean())
print('\nAverage Gross Income of Movies for Winning Actors : %.2f'%winning_actors.gross.mean())
locations = Counter([loc.split(',')[-1].strip() for loc in winning_actors.birthplace if isinstance(loc, str)])
sorted_locations = list(sorted(locations.items(), key=lambda x: x[1], reverse=True))
print('\nTop 5 BirthPlaces for Winning Actors : %s'%sorted_locations[:5])
print('\nRace Ethnicity Distribution of Winning Actors : %s'%Counter(winning_actors.race_ethnicity))
print('\nSexual Orientation Distribution of Winning Actors : %s'%Counter(winning_actors.sexual_orientation))
print('\nReligion Distribution of Winning Actors %s'%Counter(winning_actors.religion))
actor_to_oscars_cnt = sorted(Counter(winning_actors['person']).items(), key=lambda x: x[1], reverse=True)
print('\nActors who Won Oscars more than Once : %s'%[val for val in actor_to_oscars_cnt if val[1]>1 and isinstance(val[0], str)])
print('Average Nominations for Winning Actresses : %.2f'%np.mean([int(nom) for nom in winning_actresses['nominations'] if str(nom).isdigit()]))
print('\nAverage Movie Duration for Winning Actresses : %.2f'%winning_actresses.duration.mean())
genres = Counter(winning_actresses['genre1'].values.tolist()+winning_actresses['genre2'].values.tolist())
sorted_genres = list(sorted(genres.items(), key=lambda x: x[1], reverse=True))
print('\nTop 5 Genre for Winning Actresses : %s'%sorted_genres[:5])
print('\nAverage Metacritic Ratings for Winning Actresses : %.2f'%winning_actresses.metacritic.mean())
print('\nAverage Budget of Movies for Winning Actresses : %.2f'%winning_actresses.budget.mean())
print('\nAverage Gross Income of Movies for Winning Actresses : %.2f'%winning_actresses.gross.mean())
locations = Counter([loc.split(',')[-1].strip() for loc in winning_actresses.birthplace if isinstance(loc, str)])
sorted_locations = list(sorted(locations.items(), key=lambda x: x[1], reverse=True))
print('\nTop 5 BirthPlaces for Winning Actresses : %s'%sorted_locations[:5])
print('\nRace Ethnicity Distribution of Winning Actresses : %s'%Counter(winning_actresses.race_ethnicity))
print('\nSexual Orientation Distribution of Winning Actresses : %s'%Counter(winning_actresses.sexual_orientation))
print('\nReligion Distribution of Winning Actresses %s'%Counter(winning_actresses.religion))
actress_to_oscars_cnt = sorted(Counter(winning_actresses['person']).items(), key=lambda x: x[1], reverse=True)
print('Actresses who Won Oscars more than Once : %s'%[val for val in actress_to_oscars_cnt if val[1]>1 and isinstance(val[0], str)])
print('Average Nominations for Winning Directors : %.2f'%np.mean([int(nom) for nom in winning_directors['nominations'] if str(nom).isdigit()]))
print('\nAverage Movie Duration for Winning Directors : %.2f'%winning_directors.duration.mean())
genres = Counter(winning_directors['genre1'].values.tolist()+winning_directors['genre2'].values.tolist())
sorted_genres = list(sorted(genres.items(), key=lambda x: x[1], reverse=True))
print('\nTop 5 Genre for Winning Directors : %s'%sorted_genres[:5])
print('\nAverage Metacritic Ratings for Winning Directors : %.2f'%winning_directors.metacritic.mean())
print('\nAverage Budget of Movies for Winning Directors : %.2f'%winning_directors.budget.mean())
print('\nAverage Gross Income of Movies for Winning Directors : %.2f'%winning_directors.gross.mean())
locations = Counter([loc.split(',')[-1].strip() for loc in winning_directors.birthplace if isinstance(loc, str)])
sorted_locations = list(sorted(locations.items(), key=lambda x: x[1], reverse=True))
print('\nTop 5 BirthPlaces for Winning Directors : %s'%sorted_locations[:5])
print('\nRace Ethnicity Distribution of Winning Directors : %s'%Counter(winning_directors.race_ethnicity))
print('\nSexual Orientation Distribution of Winning Directors : %s'%Counter(winning_directors.sexual_orientation))
print('\nReligion Distribution of Winning Directors %s'%Counter(winning_directors.religion))
directors_to_oscars_cnt = sorted(Counter(winning_actresses['person']).items(), key=lambda x: x[1], reverse=True)
print('\nDirectors who Won Oscars more than Once : %s'%[val for val in directors_to_oscars_cnt if val[1]>1 and isinstance(val[0], str)])
print('Average Nominations for Winning Movies : %.2f'%np.mean([int(nom) for nom in winning_movies['nominations'] if str(nom).isdigit()]))
print('\nAverage Movie Duration for Winning Directors : %.2f'%winning_directors.duration.mean())
genres = Counter(winning_movies['genre1'].values.tolist()+winning_movies['genre2'].values.tolist())
sorted_genres = list(sorted(genres.items(), key=lambda x: x[1], reverse=True))
print('\nTop 5 Genre for Winning Directors : %s'%sorted_genres[:5])
print('\nAverage Metacritic Ratings for Winning Directors : %.2f'%winning_movies.metacritic.mean())
print('\nAverage Budget of Movies for Winning Directors : %.2f'%winning_movies.budget.mean())
print('\nAverage Gross Income of Movies for Winning Directors : %.2f'%winning_movies.gross.mean())